Principal Component Analysis and Effective K-Means Clustering

نویسندگان

  • Chris H. Q. Ding
  • Xiaofeng He
چکیده

The widely adopted K-means clustering algorithm uses a sum of squared error objective function. A detailed analysis shows the close relationship between K-means clustering and principal component analysis (PCA) which is extensively utilized in unsupervised dimension reduction. We prove that the continuous solutions of the discrete K-means clustering membership indicators are the data projections on the principal directions (principal eigenvectors of the covariance matrix). New lower bounds for K-means objective function are derived, which relate directly to the eigenvalues of the covariance matrix. Experiments on Internet newsgroups indicate that the new bounds are within 0.5-1.5% of the optimal values, and that PCA provides an effective solution for the K-means clustering. 1 Principal Component Analysis Principal component analysis (PCA)[5] in multivariate statistics is widely adopted as an effective unsupervised dimension reduction method and is extended in many different directions. The main justification of dimension reduction is that PCA uses singular value decomposition (SVD) which gives the best low rank approximation to original data in L2 norm due to Eckart-Young theorem. However, this essentially noise reduction perspective alone is inadequate to explain the effectiveness of PCA. In this paper, we provide a new perspective of PCA based on its close relationship with the K-means clustering algorithm. We show that the principal components are actually relaxed cluster membership indicators. Some background on PCA. The original n data points in m-dimensional space is contained in the data matrix (x1, · · · ,xn) = X. In general data is not centered around the origin. We define the centered data matrix Y = (y1, · · · ,yn), where yi = xi − x̄ and x̄ = ∑ i xi/n. The covarance matrix is given by S = ∑ i(xi − x̄)(xi − x̄)T = Y Y . The principal eigenvectors uk of Y Y T are the principal directions of the data Y . The principal eigenvectors vk of the Gram matrix Y T Y are the principal components; entries of each vk are the projected values of data points on the principal direction uk. vk and uk are related via: vk = Y T uk/λ 1/2 k . where λk is the eigenvalue of the covarance matrix Y Y . 2 K-means clustering The popular K-means algorithm [3] is an error minimization algorithm where the objective function is the sum of error squared,

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparing k-means clusters on parallel Persian-English corpus

This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...

متن کامل

A New Method for Dimensionality Reduction using K-Means Clustering Algorithm for High Dimensional Data Set

Clustering is the process of finding groups of objects such that the objects in a group will be similar to one another and different from the objects in other groups. Dimensionality reduction is the transformation of high-dimensional data into a meaningful representation of reduced dimensionality that corresponds to the intrinsic dimensionality of the data. K-means clustering algorithm often do...

متن کامل

Mixed Qualitative/Quantitative Dynamic Simulation of Processing Systems

In this article the methodology proposed by Li and Wang for mixed qualitative and quantitative modeling and simulation of temporal behavior of processing unit is reexamined and extended to more complex case. The main issue of their approach considers the multivariate statistics of principal component analysis (PCA), along with clustered fuzzy digraphs and reasoning. The PCA and fuz...

متن کامل

Cluster Analysis of Electrical Behavior

In this paper, we apply clustering analysis of data mining into power system. We adapt K-means clustering algorithm to analyze customer load, analyzing similar behavior between customer of electricity, and we adapt principal component analysis to get the clustering result visible, Simulation and analysis using matlab, and this well verify cluster rationality. The conclusion of this paper can pr...

متن کامل

The main essence of using statistical methods for outlier detection in anomaly-based approach lies in analyzing and mining information from raw data, to improve learning

Intrusion detection is an effective mechanism to deal with challenges in network security. The rapid development in networking technology has raised the need for an effective intrusion detection system (IDS) as traditional intrusion detection methods cannot compete against the newly advanced intrusion attacks. With increasing number of data being transmitted daily to/from a network, the system ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004